Nature of Driving Force for Protein Folding — 
A Result From Analyzing the Statistical Potential 
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In a statistical approach to protein structure analysis, Miyazawa and Jernigan (MJ) derived a 
20 x 20 matrix of inter-residue contact energies between different types of amino acids. Using the 
method of eigenvalue decomposition, we find that the MJ matrix can be accurately reconstructed 
from its first two principal component vectors as Mij — Co + Ci(qi + qj) + C^qtqj, with constant 
C's, and 20 q values associated with the 20 amino acids. This regularity is due to hydrophobic 
interactions and a force of demixing, the latter obeying Hildebrand's solubility theory of simple 
liquids. 
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Proteins fold into specific three dimensional structures 
to perform their diverse biological functions. It is now 
well established that for small proteins the information 
contained in the amino acid sequence is sufficient to de- 
termine the folded structure, which is the structure with 
minimum free energy Q]. Thus the native structure is 
dictated by the physical interactions between amino acids 
in the sequence, and understanding the nature of such in- 
teractions is crucial for protein structure prediction. 

As a protein contains thousands of atoms and inter- 
acts with a huge number of water molecules, it is not 
feasible to calculate the free energy function from first 
principles. An often adapted practical approach is to 
derive a coarse grained potential (often on the level of 
amino acids) using the known structures in the existing 
protein data banks. In such an approach, the energy of 
a particular substructure in proteins is derived from the 
number of its appearances in the structure data bank via 
a Boltzmann factor |^-|5|]. A classic example of such a 
statistical potential is the Miyazawa- Jernigan (MJ) ma- 
trix, a 20 x 20 inter-residue contact-energy matrix derived 
by Miyazawa and Jernigan [§,[5]|| . This matrix tabulates 
the interaction strength between any two types of amino 
acids in proteins, and has been widely applied in protein 
design and folding simulations 0,[tJ . 

In this letter, we apply a general method of matrix 
analysis, namely, eigenvalue decomposition, to the MJ 
matrix ||. The analysis reveals an intrinsic regularity 
of the MJ matrix, which yields basic information about 
the nature of the driving force for protein folding. We 
show that despite the complicated interactions in pro- 
teins, the major driving force is hydrophobic interaction 
and a force of demixing, the latter obeying Hildebrand's 
solubility theory of simple liquids || . The result allows 
us to attribute the interactions responsible for folding to 
quantifiable properties of individual amino acids. These 
properties suggest further experimental tests, and can be 
used for analyzing sequence-structure relation. 

Eigenvalue decomposition is a general approach to an- 
alyzing matrices. A given N x N real symmetric matrix 



M can be reconstructed by the following formula 
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Mij — ^aV at iV a j, 



(1) 



where My is the element of the matrix in row i and 
column j, X a is the ath eigenvalue, and V a ,i is the ith 
component of the corresponding eigenvector. We have 
analyzed the MJ matrix using eigenvalue decomposition. 
First, we subtract the mean < My > from each element 
and then analyze the eigenvalue spectrum of the remain- 
ing matrix. We find that the eigenvalue spectrum has 
two dominant eigenvalues which are much larger in mag- 
nitude than the rest. Specifically, we find Ai = —22.49, 
A2 = 18.62, while the rest of the eigenvalues have ab- 
solute values between 2.17 and 0.013. This suggests (as 
we shall demonstrate below) that the matrix can be accu- 
rately reconstructed using only the first two eigenvectors, 
Mij =< M^ > +AiVi ) iVij + A 2 V r 2 : iV r 2j. Further anal- 
ysis shows that the second eigenvector is related to the 
first one by a shift and rescaling, i.e., V^j = (3 + jVl^, 
with j3 = —0.30, 7 = —0.90, and a correlation coefficient 
0.986. Using this relation, the expression for My can be 
written simply as 



Mi, 



Co + Cife + qj) + C 2 qiqj, 



(2) 



where qi = V\,i, and the C's are constants, Co = —1.492, 
C\ = 5.030, C2 = —7.400. Thus we can reconstruct the 
MJ matrix (which in principle could have 210 indepen- 
dent elements) by using only twenty parameters asso- 
ciated with the twenty amino acids, and three interaction 
coefficients. Such a simple interaction form is often the 
starting point for theoretical modeling of proteins Jl(| . 

The spectrum of the MJ matrix (two large eigenvalues 
with corresponding eigenvectors related to each other) re- 
flects the specific physical interaction between the amino 
acids. The connection between the interaction and the 
spectrum can be understood in the following general way: 
Consider a pairwise interaction matrix My which is de- 
termined by certain properties of two species i and j, 
denoted by and qj. Assume, on physical grounds, that 
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Mij can be expressed as an analytical function f{qi,qj) 
with a well denned converging power series, f(qi,Qj) = 
C +Ci (q t +qj )+C 2 q i q j +C 3 (qf +q] )+C± qf) + -, 

where the C's are constants. Take first the example 
where the expansion ends at the C 2 term, i.e., Mjj = 
Co+C\(qi+qj)+C2qiqj- Since any row of the matrix M is 
given by a vector Ui = (Co+Ciqi)I+(Ci+C 2 qi)Q, which 
is a linear combination of I and Q, where I = {1, 1, ....1}, 
and Q = {qi, q 2 ....q n }, one can decompose the vector 
space Q into the subspace Q\\ spanned by I and Q, and 
its perpendicular compliment C?j_. It is obvious that Q± 
gives rise to n — 2 zero eigenvalues, as M~V± = for any 
vector in the subspace Q±. Furthermore, the two 
eigenvectors with nonzero eigenvalues must be express- 
ible as a linear combination of I and Q, therefore they 
are related to each other by a shift and rescaling. Simi- 
larly, if the expansion ends at the C4 term, there will be 
three nonzero eigenvalues, and the corresponding eigen- 
vectors will lie in a subspace spanned by I, Q, and Q 2 , 
where Q 2 = {qf, q\, g 2 }. The same argument applies 
to all higher order expansions. This analysis applies to 
the ideal case where there is no noise in the matrix. In- 
troducing noise leads to a slight mixing of Q± and G\\ 
and therefore to small nonzero values for the rest of the 
eigenvalue spectrum. 
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FIG. 1. Correlation between Mij, the original matrix el- 
ements and Mij, the matrix elements reconstructed from 
Eq. (|). The regression line is y = 0.999a; - 0.008. The 
correlation coefficient is 0.989. Insert: the distribution of the 
MJ matrix elements. The unit of energy is ksT. 

The reconstructed matrix in Eq. (|J) reproduces the 
original MJ matrix to a high accuracy. Fig. 1 shows the 
correlation between the original MJ matrix and the re- 
constructed one. The regression line is y = 0.999x + 
0.008, and the correlation coefficient is 0.989. On av- 



erage Eq. (g) gives matrix elements with only 5% error 
compared to the original matrix. 

Notice that one can redefine the q's in Eq. (|J) by a 
shift and rescaling while leaving the interaction form un- 
changed. Therefore any transformation q — > Aq + B 
with a corresponding change in the C's yields an iden- 
tical matrix. To better understand the physical meaning 
of Eq. (|^), we rewrite it in the following form, 

Mij =hi + hj - C 2 (q, - qjf/2, (3) 

where 

hi = Co/2 + Ciqt + (C 2 /2)q 2 . (4) 

Now each term in Eq. (||) above is invariant under the 
transformation discussed above. 

What is the physical basis for the simple interaction 
form in Eq. (^)? Consider the quantity Xij = %Mij — 
Ma — Mjj. Since M^ is the energy of forming a contact 
between type i and type j amino acids in water, Xij gives 
the energy of breaking one i-i contact and one j-j contact 
and forming two pairs of i-j contacts; thus Xij is the en- 
ergy change due to the mixing of the two types of amino 
acids. According to Eq. (^), Xij — — C 2 (qi — qj) 2 ■ This 
form has a striking similarity to the mixing energy of two 
simple liquids as given by Hildebrand's solubility theory 
(HST) || . In his 1933 classic paper, Hildebrand derived 
the energy of mixing of two simple liquids by summing 
over the pairwise interactions throughout the mixture. 
Assuming that the mixing is random and that the po- 
tentials between molecules are of the Lennard- Jones type 
due to the London dispersion force Hildebrand ar- 
rived at a formula which expresses the energy of mixing of 
liquids A and B as E m i X i ng oc (5a — 5b) 2 , where 5a. b are 
pure component properties related to the square root of 
the vaporization energies of liquid A and B, traditionally 
called the "solubility parameter" . 

Now we can imagine the formation of 2 i-j contacts 
in water by two steps, formation of an i-i contact and 
a j-j contact followed by a mixing. The energy change 
for the first step is 2hi + 2hj, and that for the second 
step Xij- As the formation of an i-i contact in water is 
related to the segregation of amino acids of type i in wa- 
ter, we expect that hi is related to the hydrophobicity of 
amino acid i. Indeed, we find that hi correlates very well 
with the hydrophobicity scales published in the literature 
|l2| (see Fig. 2). Thus despite the complicated interac- 
tions in proteins, we find that the pairwise inter-residue 
interactions responsible for folding can be attributed to 
the hydrophobic force and a force of demixing, the latter 
obeying HST (Although HST was derived for simple non- 
polar molecules, it was found previously that the theory 
describes well the behavior of polymer blends |l3| . The 
application to proteins is another example of the more 
general scope of HST.). 
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FIG. 2. Calculated ftj and measured hydrophobicities [12] 
of the 20 amino acids. The type of amino acid is indicated 
using the standard one letter code. The straight line is a lin- 
ear fit (excluding the charged amino acids) with slope 1.314 
and intercept 0.759. The correlation coefficient is 0.769. 

The above analysis presents a simple picture of the na- 
ture of interactions between amino acids. It also provides 
experimentally testable predictions. Comparison with 
HST indicates that the q^ we derive should be linearly 
related to the solubility parameter of amino acid i, which 
can be measured. Furthermore, we predict from Eq. (^) 
that hydrophobicity can be expressed as a quadratic func- 
tion of the solubility parameter. Since the solubility pa- 
rameter and the hydrophobicity of an amino acid can 
be measured independently, this prediction can also be 
tested. 

Comparison of the terms in Eq. (|J) shows that the lin- 
ear term hi + hj is the dominant one in selecting the 
native structure. This is because the typical difference of 
the linear term Sh (among different types of contacts) is 
much larger than the typical difference of the square term 
<5x/2, specifically, Sh = 6.52(<$x/2). Therefore the energy 
difference between different compact structures (due to 
different arrangements of the contacts) is mainly due to 
the linear term. Thus, through a quantitative analysis of 
the MJ matrix we arrive at the conclusion that the hy- 
drophobic force is the dominant driving force for protein 
folding [[[|- 

The term — Ci{qi — qj) 2 /2 has an important conse- 
quence, however. This term favors demixing of amino 
acids (C2 is negative). The microscopic basis for such a 
demixing force is the dissimilar polarizability of the two 
Since the interior of a protein is com- 



regate, an amino acid buried in the interior of a protein 
will experience an environment which is quite different 
from a uniform non-polar environment. It has been con- 
troversial whether one can model the interior of a protein 
as a uniform non-polar environment Jig] . This study sug- 
gest that in general it is not adequate to do so. 

Notice that in Fig. 2 the charged amino acids (E, D, R, 
K) fall into a distinct group. Since Eq. (|2|) also gives ac- 
curate values for the matrix elements involving charged 
amino acids, we believe our q scale captures more infor- 
mation regarding folding than a simple hydrophobicity 
scale. Other noticeable exceptions are the cyclic amino 
acid Proline, and the two amino acids with aromatic 
residues Tryptophan and Tyrosine. 

The q values we obtain can be used to character- 
ize amino acids. The distribution of the q values is 
bimodal (see Fig. 3), which supports the notion that 
amino acids naturally fall into two distinct groups: "po- 
lar" (P) and "hydrophobic" (H). This division also ac- 
counts for the three different regions in the distribution 
of the MJ matrix elements (see the insert to Fig. 1), 
which reflect the three possible combinations of the two 
groups: polar-polar, polar- hydrophobic, hydrophobic- 
hydrophobic. The sharp division between the two groups 
as indicated in Fig. 3 also suggests that amino acids in 
the same group may play similar roles in structure deter- 
mination. There is experimental evidence to this effect 
insofar as certain proteins can be designed by specifying 
only the HP pattern of the sequence |l6| . For the pur- 
pose of protein design, the q values can serve as a useful 
scale for selecting amino acids. 
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FIG. 3. Distribution of q values of the 20 amino acids. The 
amino acids fall into two groups: "polar", large q, and "hy- 
drophobic", small q. 

The q values can also be used to analyze the relation 
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between sequence and structure. In previous studies, hy- 
drophobicity scales have been used to analyze sequences 
and locate helical segments [Q. However, there exist 
many different hydrophobicity scales. Our q scale has 
the advantage of being more closely related to the in- 
teractions which determine structure. We find that for 
a given sequence, segments with alternating large and 
small q values usually correspond to a helices (consistent 
with the previous findings using hydrophobic scales), seg- 
ments with long stretches of large q values usually corre- 
spond to reverse turns, and segments with long stretches 
of small q values usually correspond to (5 strands. An ex- 
ample is shown in Fig. 4 of the 3D structure of the protein 
flavodoxin p8| , with amino acids color coded according 
to their q values. 

To summarize, we were able to extract the regularity 
of the Miyazawa- Jernigan matrix of inter-residue contact 
energies between amino acids using the method of eigen- 
value decomposition. The analysis reveals that the driv- 
ing force for protein folding is the hydrophobic force and 
a force of demixing between amino acids. We were able 
to construct a solubility scale for amino acids which can 
be tested experimentally. This scale can also be used for 
selecting amino acids for the purpose of protein design, 
and for analyzing sequence-structure relation. 
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FIG. 4. 3D structure of the protein flavodoxin with only 
the main chain atoms plotted. Amino acids are color coded 
according to their q values. 
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